-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wikilinks NN method for generating embeddings #47
Conversation
Codecov Report
@@ Coverage Diff @@
## main #47 +/- ##
==========================================
+ Coverage 81.04% 82.11% +1.06%
==========================================
Files 4 4
Lines 575 654 +79
==========================================
+ Hits 466 537 +71
- Misses 109 117 +8
Continue to review full report at Codecov.
|
Results look great, and I'm really happy we have so much progress so quickly! Thanks a lot for all of this :) I'll give this all a read through in the next few days. I might change the method to Great idea with loading the pre-trained model! That's perfect 😊 I'll try to figure out a way to make it work with non-book inputs, with us potentially just needing to add an optional argument that's needed in case this method's ran. I'll likely just work on this branch for a bit and also add in the needed readme changes and documentation before it's merged. That and potentially the tests, although I'll need to look first to see how involved they'll be. Again though, nice results! Honestly stoked for this :) |
Sounds good! I know it is a bit messy as it is right now. So I appreciate you being able to look over it yourself and making any necessary changes. And feel free to change the method's name to My one thought about working with non-book inputs: I guess you'd have to call in methods from other utility functions that will download the wiki dump and parse a new json. It'd be quite a huge function call! Let me know if you'd like thoughts on the tests or something! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victle All looks fine to me :) I'll work on adding to this more tomorrow, and will send along a commit that hopefully will get it working for articles outside of books. I'll then test it and run it through examples/rec_books
, and with that we can add results and put it to bed :)
If you have an idea of how best to test this, then let me know. I honestly have some pretty baseline tests for this package that are just making sure that returns are the appropriate type and dimensions and that the files can be downloaded.
src/wikirec/model.py
Outdated
|
||
n_positive = 1024 | ||
gen = _generate_batch(pairs, n_positive, negative_ratio = 2) | ||
h = model.fit_generator(gen, epochs = 15, steps_per_epoch = len(pairs) // n_positive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victle Just checking here, h is not used as it's just a placeholder for the generator being fit? I'll add a # pylint: disable=unused-variable
if that's the case :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewtavis Yep! h is just a placeholder.
@victle, first and foremost, my apologies that this has taken so long. The last few weeks can adequately be described as shit 😄 I've been working on getting this all set up for running on any kind of input including books. With that being said, I'm currently getting the following when running
Darkness and the Light is the last entry in my books json. Have you seen this before, and if so do you have an idea of how to fix it? I might have swapped a variable name in there somewhere as I was switching from "books" to "articles", but then I was mostly search-all replacing on it. Just wanted to check on this for now. I'll take another look at it if you're not sure. The basics of the tests are written (will just do a simple one to check output lengths as with the others so the process is ran through). Hopefully we can get #36 done in the next few days :) Again, my apologies for not getting to this and further not communicating that it'd be a bit. Hope you're well! |
The code that I'm running that's producing the above error is the following: wikilink_embeddings = model.gen_embeddings(
method="WikilinkNN",
path_to_json="./enwiki_books.ndjson",
path_to_embedding_model="books_embedding_model",
embedding_size=50,
) Basically I set it up to accept arguments that lead to a given json and a desired embeddings model. There are other changes to |
No worries on the time! I've honestly been pretty busy too and haven't been able to dedicate enough time 😢 Regarding that error, I don't think I've ever seen it before, nor do I know what might be causing it exactly. |
src/wikirec/model.py
Outdated
# Reshape to be a single number (shape will be (None, 1)) | ||
merged = Reshape(target_shape = [1])(merged) | ||
|
||
model = Model(inputs = [book, link], outputs = merged) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victle, Figured out that this here is supposed to be book_input
and link_input
based on comparing it all to the original blogpost and your changes :)
Am making good progress on this and will try to have it done by tonight/early tomorrow. Thanks again for all this!
@victle, fully up and running 😊 I added a dot product for WikilinkNN into I'm gonna check out combining WikilinkNN with the other methods now (getting rid of the sim_matrix rows via |
I'm a bit concerned this might be adding too much at once (and there will be stylistic/usability changes here and there), but I think most of the framework is here. I totally anticipate that changes will need to be made to clean things up. For example, the method only works with the
books
topics, and cleans data based on that assumption. Below is a list of what I tried to add:model.gen_embeddings()
, calledwikilinks
wikilinks
it will try to load a pre-trained model (I uploaded this under /examples/wikilinks_embedding_model.h5). Ideally, this would be part of the repo, and would be sufficient for playing around with the method (although it's on the larger size > 50 MB)._wikilinks_nn(...)
, which loads theenwiki_books.ndjson
and builds/trains a neural network model on the wikilinks (this was adapted from the blogpost)._wikilinks_nn(..)
all the way through. But, the code was taken from my working Google Colab so I assumed it'd be ok. Let me know if I should run this all the way through before we can merge the PR.For verification, I did a lot of the testing in a personal Google Colab notebook; here were the recommendations for War and Peace!